60 research outputs found

    Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences

    Get PDF
    Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences

    Multi-task Deep Neural Networks in Automated Protein Function Prediction

    Full text link
    In recent years, deep learning algorithms have outperformed the state-of-the art methods in several areas thanks to the efficient methods for training and for preventing overfitting, advancement in computer hardware, the availability of vast amount data. The high performance of multi-task deep neural networks in drug discovery has attracted the attention to deep learning algorithms in bioinformatics area. Here, we proposed a hierarchical multi-task deep neural network architecture based on Gene Ontology (GO) terms as a solution to protein function prediction problem and investigated various aspects of the proposed architecture by performing several experiments. First, we showed that there is a positive correlation between performance of the system and the size of training datasets. Second, we investigated whether the level of GO terms on GO hierarchy related to their performance. We showed that there is no relation between the depth of GO terms on GO hierarchy and their performance. In addition, we included all annotations to the training of a set of GO terms to investigate whether including noisy data to the training datasets change the performance of the system. The results showed that including less reliable annotations in training of deep neural networks increased the performance of the low performed GO terms, significantly. We evaluated the performance of the system using hierarchical evaluation method. Mathews correlation coefficient was calculated as 0.75, 0.49 and 0.63 for molecular function, biological process and cellular component categories, respectively. We showed that deep learning algorithms have a great potential in protein function prediction area. We plan to further improve the DEEPred by including other types of annotations from various biological data sources. We plan to construct DEEPred as an open access online tool.Comment: 19 pages, 4 figures, 4 table

    SELFormer: Molecular Representation Learning via SELFIES Language Models

    Full text link
    Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.Comment: 22 pages, 4 figures, 8 table

    Protein domain-based prediction of drug/compound–target interactions and experimental validation on LIM kinases

    Get PDF
    Predictive approaches such as virtual screening have been used in drug discovery with the objective of reducing developmental time and costs. Current machine learning and network-based approaches have issues related to generalization, usability, or model interpretability, especially due to the complexity of target proteins’ structure/function, and bias in system training datasets. Here, we propose a new method “DRUIDom” (DRUg Interacting Domain prediction) to identify bio-interactions between drug candidate compounds and targets by utilizing the domain modularity of proteins, to overcome problems associated with current approaches. DRUIDom is composed of two methodological steps. First, ligands/compounds are statistically mapped to structural domains of their target proteins, with the aim of identifying their interactions. As such, other proteins containing the same mapped domain or domain pair become new candidate targets for the corresponding compounds. Next, a million-scale dataset of small molecule compounds, including those mapped to domains in the previous step, are clustered based on their molecular similarities, and their domain associations are propagated to other compounds within the same clusters. Experimentally verified bioactivity data points, obtained from public databases, are meticulously filtered to construct datasets of active/interacting and inactive/non-interacting drug/compound–target pairs (~2.9M data points), and used as training data for calculating parameters of compound–domain mappings, which led to 27,032 high-confidence associations between 250 domains and 8,165 compounds, and a finalized output of ~5 million new compound–protein interactions. DRUIDom is experimentally validated by syntheses and bioactivity analyses of compounds predicted to target LIM-kinase proteins, which play critical roles in the regulation of cell motility, cell cycle progression, and differentiation through actin filament dynamics. We showed that LIMK-inhibitor-2 and its derivatives significantly block the cancer cell migration through inhibition of LIMK phosphorylation and the downstream protein cofilin. One of the derivative compounds (LIMKi-2d) was identified as a promising candidate due to its action on resistant Mahlavu liver cancer cells. The results demonstrated that DRUIDom can be exploited to identify drug candidate compounds for intended targets and to predict new target proteins based on the defined compound–domain relationships. Datasets, results, and the source code of DRUIDom are fully-available at: https://github.com/cansyl/DRUIDom

    CROssBAR: comprehensive resource of biomedical relations with knowledge graph representations

    Get PDF
    Systemic analysis of available large-scale biological/biomedical data is critical for studying biological mechanisms, and developing novel and effective treatment approaches against diseases. However, different layers of the available data are produced using different technologies and scattered across individual computational resources without any explicit connections to each other, which hinders extensive and integrative multi-omics-based analysis. We aimed to address this issue by developing a new data integration/representation methodology and its application by constructing a biological data resource. CROssBAR is a comprehensive system that integrates large-scale biological/biomedical data from various resources and stores them in a NoSQL database. CROssBAR is enriched with the deep-learning-based prediction of relationships between numerous data entries, which is followed by the rigorous analysis of the enriched data to obtain biologically meaningful modules. These complex sets of entities and relationships are displayed to users via easy-tointerpret, interactive knowledge graphs within an open-access service. CROssBAR knowledge graphs incorporate relevant genes-proteins, molecular interactions, pathways, phenotypes, diseases, as well as known/predicted drugs and bioactive compounds, and they are constructed on-the-fly based on simple non-programmatic user queries. These intensely processed heterogeneous networks are expected to aid systems-level research, especially to infer biological mechanisms in relation to genes, proteins, their ligands, and diseases

    Crowdsourced mapping of unexplored target space of kinase inhibitors

    Get PDF
    Despite decades of intensive search for compounds that modulate the activity of particular protein targets, a large proportion of the human kinome remains as yet undrugged. Effective approaches are therefore required to map the massive space of unexplored compound-kinase interactions for novel and potent activities. Here, we carry out a crowdsourced benchmarking of predictive algorithms for kinase inhibitor potencies across multiple kinase families tested on unpublished bioactivity data. We find the top-performing predictions are based on various models, including kernel learning, gradient boosting and deep learning, and their ensemble leads to a predictive accuracy exceeding that of single-dose kinase activity assays. We design experiments based on the model predictions and identify unexpected activities even for under-studied kinases, thereby accelerating experimental mapping efforts. The open-source prediction algorithms together with the bioactivities between 95 compounds and 295 kinases provide a resource for benchmarking prediction algorithms and for extending the druggable kinome. The IDG-DREAM Challenge carried out crowdsourced benchmarking of predictive algorithms for kinase inhibitor activities on unpublished data. This study provides a resource to compare emerging algorithms and prioritize new kinase activities to accelerate drug discovery and repurposing efforts

    Hidrojen peroksit, gallik asit ve resveratrolün aspergillus fumigatus'un büyümesi ve katalaz üretimi üzerindeki etkileri.

    No full text
    The aim of this study was to analyze the effect of hydrogen peroxide and selected phenolic compounds on growth and catalase production of Aspergillus fumigatus. As a result of growing A. fumigatus at different temperatures it was observed that, growth and catalase production of this species were highest at 37 °C. Catalase production was highest in the presence of 1 mM H2O2, yielding a significant 3 fold increase with respect to the control. Biomass was also increased by 1,44 fold with respect to the control sample. H2O2 increased catalase production possibly by inducing oxidative stress as biomass production significantly increased after the depletion of H2O2. Both gallic acid and trans-resveratrol significantly enhanced biomass generation of A. fumigatus (1,17 fold increase at 10 mM gallic acid and 1,45 fold increase at 3 mM resveratrol with respect to controls) and decreased extracellular catalase production (4,33 fold at 25 mM gallic acid and 16,7 fold decrease at 3 mM resveratrol with respect to controls) especially in the first 5 or 6 days of the cultivation where the anti-oxidant activity of the compounds were possibly at their maximum. A sudden and significant rise was observed in extracellular catalase activity between 5th and 7th days of the cultivation in phenolic compound applied samples, possibly owing to the depletion of the antioxidant activity of gallic acid and resveratrol followed by fungal cells’ response to a sudden increase of oxidative stress by boosting catalase production.M.S. - Master of Scienc

    Minstrelsy tradition in Gebze

    No full text
    “Gebze’de Âşıklık Geleneği” adlı bu yüksek lisans tezinde, öncelikle Gebze’nin tarihî geçmişi, coğrafî özellikleri, nüfus ve ekonomik yapısı, sosyokültürel yapısı üzerinde durulmuştur. Âşıklığın tanımı ve içeriği, oluşumundan günümüze âşıklık geleneği açıklanmıştır. Tarihî süreçte Gebze’ye yapılan göçler ve geleneğin taşınması, Gebze'de âşıklık geleneğinin oluşumunu sağlayan Kars, Erzurum ve Ardahan’daki geleneğin Gebze’deki icra biçimleri, âşıkların Gebze'deki icra ortamları, elektronik kültür ortamlarından faydalanma süreçleri ve geleneğin elektronik kültür ortamına bağlı yaşadığı değişim ele alınmıştır. Ayrıca geleneğin unsurlarının geleceğe ne şekilde ve ne derece taşındığı, Gebze’ye göçlerin nedenleri ve bunun geleneğe olan etkileri Gebze Âşıklar, Şairler, Ozanlar ve Edebiyatçılar Derneği ve Gedikler Kültür Evinden âşıkların ne derece yararlandığı ile alakalı veriler yapılan derlemeler sonucu elde edilmiştir. Elde edilen sonuçlar, halk bilimsel açıdan ele alınmıştır.In this postgraduate thesis named “Minstrelsy Tradition in Gebze”, the historical background, demographic and economic structure and socio-cultural status of Gebze is stressed. The definition and content, since its origin and its path to today, Minstrelsy tradition is expressed. Throughout the historical period, the migrations and the transfer of this tradition to Gebze along with the Minstrels’ various performing forms and atmospheres, which emerged in Kars, Erzurum and Ardahan, their utilization processes from electronical platforms along with the evolutionary period of this tradition related to electronical culture context are covered in this thesis. In addition, the extent and way of transferring the elements of this tradition to the future, the reasons of migration to Gebze, how Minstrels benefit from “Minstrels, Poets and Men of Letters Society” and “Gedik Culture House” are concluded thanks to the compilations gathered during this study. The collected outcomes are handled in terms of social sciences

    HPO2GO: prediction of human phenotype ontology term associations for proteins using cross ontology annotation co-occurrences

    Get PDF
    Analysing the relationships between biomolecules and the genetic diseases is a highly active area of research, where the aim is to identify the genes and their products that cause a particular disease due to functional changes originated from mutations. Biological ontologies are frequently employed in these studies, which provides researchers with extensive opportunities for knowledge discovery through computational data analysis. In this study, a novel approach is proposed for the identification of relationships between biomedical entities by automatically mapping phenotypic abnormality defining HPO terms with biomolecular function defining GO terms, where each association indicates the occurrence of the abnormality due to the loss of the biomolecular function expressed by the corresponding GO term. The proposed HPO2GO mappings were extracted by calculating the frequency of the co-annotations of the terms on the same genes/proteins, using already existing curated HPO and GO annotation sets. This was followed by the filtering of the unreliable mappings that could be observed due to chance, by statistical resampling of the co-occurrence similarity distributions. Furthermore, the biological relevance of the finalized mappings were discussed over selected cases, using the literature. The resulting HPO2GO mappings can be employed in different settings to predict and to analyse novel gene/protein—ontology term—disease relations. As an application of the proposed approach, HPO term—protein associations (i.e., HPO2protein) were predicted. In order to test the predictive performance of the method on a quantitative basis, and to compare it with the state-of-the-art, CAFA2 challenge HPO prediction target protein set was employed. The results of the benchmark indicated the potential of the proposed approach, as HPO2GO performance was among the best (Fmax = 0.35). The automated cross ontology mapping approach developed in this work may be extended to other ontologies as well, to identify unexplored relation patterns at the systemic level. The datasets, results and the source code of HPO2GO are available for download at: https://github.com/cansyl/HPO2GO

    2-D thresholding of the connectivity map following the multiple sequence alignments of diverse datasets

    No full text
    10th IASTED International Conference on Biomedical Engineering, BioMed 2013; Innsbruck; Austria; 13 February 2013 through 15 February 2013Multiple sequence alignment (MSA) is a widely used method to uncover the relationships between the biomolecular sequences. One essential prerequisite to apply this procedure is to have a considerable amount of similarity between the test sequences. It's usually not possible to obtain reliable results from the multiple alignments of large and diverse datasets. Here we propose a method to obtain sequence clusters of significant intragroup similarities and make sense out of the multiple alignments containing remote sequences. This is achieved by thresholding the pairwise connectivity map over 2 parameters. The first one is the inferred pairwise evolutionary distances and the second parameter is the number of gapless positions on the pairwise comparisons of the alignment. Threshold curves are generated regarding the statistical parameter values obtained from a shuffled dataset and probability distribution techniques are employed to select an optimum threshold curve that eliminate as much of the unreliable connectivities while keeping the reliable ones. We applied the method on a large and diverse dataset composed of nearly 18000 human proteins and measured the biological relevance of the recovered connectivities. Our precision measure (0.981) was nearly 20% higher than the one for the connectivities left after a classical thresholding procedure displaying a significant improvement. Finally we employed the method for the functional clustering of protein sequences in a gold standard dataset. We have also measured the performance, obtaining a higher F-measure (0.882) compared to a conventional clustering operation (0.827)
    corecore